Skip to content

Feature/4.0/helm installer#1064

Open
jorgemoralespou wants to merge 152 commits into
educates:developfrom
jorgemoralespou:feature/4.0/helm-installer
Open

Feature/4.0/helm installer#1064
jorgemoralespou wants to merge 152 commits into
educates:developfrom
jorgemoralespou:feature/4.0/helm-installer

Conversation

@jorgemoralespou

Copy link
Copy Markdown
Collaborator

Work on migrating Carvel installer to Helm installer.

Adds the v4 architecture-of-record under docs/architecture/:
  - educates-current-state.md  — what v3 actually is today.
  - educates-v4-development-plan.md  — phased plan + open items + the
    pre-phase chart workstream this commit-set lands.
  - educates-crd-draft-v1alpha1-r3.md  — operator CRD design (informs
    Phase 0+).
  - decisions.md  — append-only decisions log; entries grouped by
    topic with reconsider triggers where relevant.

CLAUDE.md is the briefing for any future Claude Code session in this
repo: scope of v4 vs v3, what's safe to touch, working norms,
references back to the architecture docs.

.gitignore picks up .claude/ so user-local agent state doesn't leak.
Pre-phase deliverable from docs/architecture/educates-v4-development-plan.md.
The chart is the canonical Helm install for the Educates v4 runtime —
what users helm install today (and what the v4 operator will install
on their behalf in Phase 4).

Layout:
  installer/charts/educates-training-platform/
  ├── Chart.yaml, values.yaml, .helmignore        (umbrella)
  └── charts/
      ├── secrets-manager/        (CRDs + operator: cross-NS secret
      │                            propagation primitives)
      ├── lookup-service/         (federation API; off by default)
      ├── remote-access/          (read-only RBAC + token Secret for
      │                            external CLI clients; toggleable
      │                            independently)
      └── session-manager/        (workshop runtime + bundled v3
                                   Kyverno policies)

CRDs ship in each subchart's `crds/` directory rather than templates/
because Helm can't apply CRDs and CRs of those CRDs in a single
release otherwise (see decisions.md).

session-manager ships v3-vendored Kyverno policies on two paths:
  - bundledKyvernoPolicies.clusterPolicies — Pod Security Standards
    profiles installed cluster-wide as ClusterPolicy resources.
  - bundledKyvernoPolicies.workshopPolicies — operational policies +
    Educates-internal `require-ingress-session-name`, written into
    the educates-config Secret for session-manager to clone per
    workshop environment.
additionalKyvernoPolicies.{clusterPolicies,workshopPolicies} let
admins extend either bundle through chart values — net-new vs v3,
which required out-of-band kubectl-apply.

Image refs default to ghcr.io/educates/educates-* (v3 naming
convention). Tag default is the chart's appVersion; scenarios pin to
3.7.1 because no v4 runtime images exist yet.
…lates

Stray file from troubleshooting .Files.Glob behaviour in
kyverno-cluster-policies.yaml during chart development. Helm treats
templates/_*.txt as a partial-style include and skips it during
render, so it never affected output — but it shouldn't be in the
chart.
Six end-to-end scenarios under installer/charts/educates-training-platform/tests/.
Each scenario provisions a kind cluster (cluster-only), runs an
optional pre-install hook to stage cluster-side fixtures, applies the
chart, runs an optional post-install hook before rollout-status, and
exercises the deploy-workshop / browse path.

run-scenario.sh wires four hook points (pre-install, post-install,
post-deploy, teardown) so each scenario carries its own setup +
assertions. .env loading + envsubst rendering of scenario files lets
the runner pick up DOMAIN, TLS_CERT_PATH, CA_CERT_PATH from the user
shell (handy with mkcert auto-detection) without touching scenario
files.

Scenarios:
  01-local-http-nip-io          — minimal HTTP smoke; everything
                                  optional off.
  02-kind-tls-wildcard          — offline-generated wildcard +
                                  secretPropagation.upstream paths.
  03-kind-cert-manager-issuer   — cert-manager + certs package
                                  issues the wildcard from a
                                  user-provided CA.
  04-website-theme              — custom websiteTheme reaches the
                                  live portal HTML (post-deploy
                                  curls portal URL, greps marker).
  05-image-pull-secrets         — auth'd local registry serves a
                                  copy of educates-session-manager;
                                  scenario fails if the chart's
                                  pull-secret chain breaks.
  06-additional-kyverno-policies — bundled + user-supplied
                                  ClusterPolicies on both paths
                                  (cluster-wide + per-environment).

Step 5 prints a click-through portal URL with URL-encoded password
(uses a pure-bash percent-encoder).
Add the docs-of-record for the session-manager subchart's typed values
shape (values.yaml + JSON schema), and update the v4 development plan
to mark Kyverno-policy bundling as done and rescope the typed-runtime-
config follow-up against the broader target shape.
Refactor the session-manager subchart's values surface from an opaque
config blob plus ad-hoc toggles into the typed shape defined by the
docs-of-record (docs/architecture/session-manager-chart-values{.yaml,
-schema.json}). Adds values.schema.json so Helm catches shape errors
at template time.

Key changes:

- clusterIngress / clusterSecurity / workshopSecurity replace the
  config: clusterIngress + bundledKyvernoPolicies + openshift.enabled
  knobs. policyEngine and rulesEngine are PascalCase enums in chart
  values, lowercased when emitted into the runtime config blob to
  match the runtime's existing expectations (no runtime change).
- Promotes imageRegistry, imageVersions, sessionCookies, clusterStorage,
  clusterRuntime, clusterNetwork, dockerDaemon, workshopAnalytics, and
  websiteStyling.{defaultTheme,frameAncestors} out of config into typed
  values; chart auto-injects operator.namespace (release ns) and
  version (chart appVersion).
- websiteStyling.inline.{workshopDashboard,workshopInstructions,
  workshopStarted,workshopFinished,trainingPortal} replaces the flat
  websiteTheme map; the chart maps the structured triples to the
  flat secret keys (.html / .js / .css) the runtime expects.
- SecretCopier rules for ingress TLS+CA are auto-derived from
  clusterIngress.{tls,ca}CertificateRef.namespace; the explicit
  secretPropagation.upstream.ingressTLS / ingressCA knobs are gone.
  themeDataRefs entries with non-local namespaces are also auto-copied.
- Materialises empty-string TLS/CA refs in the operator-config blob
  even when unset, papering over the runtime's xget-no-default-None
  quirk.
- config remains as an opaque escape hatch, deep-merged on top of the
  typed-derived blob so users can land new fields before promotion.

Scenarios 01 (local HTTP nip.io) and 02 (kind + TLS wildcard) are
updated to the typed shape and verified via helm template. The
remaining scenarios (03-06) move to the typed shape in a follow-up
commit.
Brings scenarios 03-06 onto the typed values shape introduced in the
previous commit and adds scenario 07 to exercise the `config:` escape-
hatch deep-merge.

- 03 (kind cert-manager issuer): cross-namespace TLS/CA refs now drive
  the SecretCopier auto-derive instead of an explicit
  `secretPropagation.upstream.ingressTLS/ingressCA` block.
- 04 (website theme): uses `websiteStyling.inline.trainingPortal.{html,
  style}`, exercising the chart's structured-triple → flat-secret-key
  mapping (`style` → `.css`).
- 05 (image-pull secrets): typed top-level `imagePullSecrets` for the
  PodSpec; `secretPropagation.imagePullSecretNames` and
  `secretPropagation.upstream.imagePullSecrets` unchanged.
- 06 (additional Kyverno policies): toggles renamed —
  `clusterSecurity.{policyEngine,additionalKyvernoPolicies}` and
  `workshopSecurity.{rulesEngine,additionalKyvernoPolicies}` replace
  the old `bundledKyvernoPolicies` / `additionalKyvernoPolicies.{cluster,
  workshop}Policies` blocks.
- 07 (new): asserts the `config:` opaque map deep-merges on top of the
  typed-derived runtime config and wins on conflict (`dockerDaemon.
  networkMTU` override) and passes through unknown fields untouched
  (`experimental.markerKey`).

Drive-by: schema's top-level `imagePullSecrets` was [string] but the
PodSpec wants the standard k8s [{name: ...}] shape — fixed in both the
chart's values.schema.json and the docs-of-record.
Records the decision behind the typed-values refactor (commits
a57ef864 / d339c016 / f61d3c8e): a single typed surface serves both
the operator and standalone chart users, with `config:` retained as
an escape hatch for not-yet-promoted runtime fields. The earlier
"operator-driven, not v3-driven" decision is superseded — its framing
left standalone users writing opaque YAML against the v3 schema from
memory, which contradicted the chart's own publish-as-canonical-Helm
install positioning.
…default imageVersions

Fixes ErrImagePull on training-portal (and any other runtime-spawned
child image) by stopping the chart from auto-injecting Chart.AppVersion
(`4.0.0-alpha.1`) as the runtime config's `version` field. Restores the
defaulted `imageVersions` list that v3's carvel installer rendered out
of `config/images.yaml`, which the previous refactor silently dropped.

- New typed value `runtimeVersion: "3.7.1"` drives:
    * the chart-pod `image.tag` default
    * the `imagePuller.pauseImage.tag` default
    * the `version` field auto-injected into the operator-config blob
- `imageVersions` defaults to the full v3 list: 12 Educates-published
  images pinned to `runtimeVersion`, 7 upstream pins (docker-in-docker,
  loftsh-kubernetes-v1.31..34, loftsh-vcluster, debian-base-image) at
  their v3-vendored tags. Visible in values.yaml so chart users see
  exactly which images the runtime can pull.
- Drops the redundant `image.tag: "3.7.1"` overrides in scenarios 01-07
  (chart default now resolves correctly). Scenario 05 keeps its
  `image.repository` override for the auth'd local registry but no
  longer pins the tag.
- Mirrored in docs-of-record (values.yaml + JSON schema), the v4 dev
  plan, and a new decisions.md entry covering the rationale and the
  documented two-place-edit when bumping `runtimeVersion`.
…helper-defaulted imageVersions with per-key merge

Supersedes the runtimeVersion + populated-values approach from e71e41af
with a more idiomatic shape:

- `Chart.appVersion` IS the runtime image version. Set to "3.7.1"
  across the umbrella + four subchart Chart.yamls (was "4.0.0-alpha.1"
  — which doesn't exist as a published image and was the original
  source of the ErrImagePull). `Chart.version` stays at 4.0.0-alpha.1
  (the chart-package version). They normally move together at release
  time; the field separation exists for chart-only patches.
- Removed `runtimeVersion` typed value. Image-tag defaults
  (`session-manager.image.tag`, `imagePuller.pauseImage.tag`) and the
  operator-config blob's `version` field all source `Chart.appVersion`
  directly.
- The full default `imageVersions` set moves into a new template
  helper `session-manager.imageVersions` (mirroring v3's
  carvel-installer images.yaml). Educates-published entries derive
  their tag from `Chart.appVersion`; upstream pins (docker-in-docker,
  loftsh-*, debian-base-image) are hard-coded.
- User-supplied `.Values.imageVersions` entries merge BY NAME on top
  of the helper defaults: an override replaces just the matching
  default's image, other defaults pass through, names not in the
  default list are appended (forward-compat). Strictly better UX than
  v3's full-list replacement; chart users override only what they need.
- `values.yaml`, `values.schema.json`, and the docs-of-record return to
  `imageVersions: []` with comments documenting the per-key merge
  semantics. The helper is the documented inventory.
- Dev plan updated. decisions.md entry replaced with the corrected
  rationale (Chart.appVersion sourcing, helper-defaulted list, per-key
  merge UX).
…-pod, pause, and runtime children

A chart user pointing at a fork or a local registry now redirects
every Educates-image reference with one knob. Previously only the
runtime-spawned children were derivable from imageRegistry — the
chart pod and the pause image were hard-coded to ghcr.io/educates/...,
which broke the dev workflow against a fork.

- `imageRegistry.host` / `.namespace` default to `ghcr.io` / `educates`
  (was empty/empty). They now compose the prefix used by:
    * `session-manager.imageRegistryPrefix` helper (new)
    * the chart-pod `image.repository` default (when empty)
    * the pause image `imagePuller.pauseImage.repository` default
      (when empty)
    * the Educates-published entries in the `session-manager.imageVersions`
      helper
- Upstream pins (docker-in-docker, loftsh-*, debian-base-image) are NOT
  relocated by imageRegistry — they're public upstream images that don't
  follow the educates-<name> naming convention. Mirror them via per-entry
  imageVersions overrides instead.
- `image.repository` and `imagePuller.pauseImage.repository` defaults
  flip from hard-coded refs to empty strings; helpers `session-manager.
  image.repository` and `session-manager.pause.image.repository` resolve
  the empty-derives-from-imageRegistry behaviour. Schema's `imageRef`
  drops the minLength on `repository` to allow the empty.
- Helper bails fast (`fail`) if imageRegistry.host is empty — the
  chart can't compose a default ref without it.
- Verified end-to-end: setting `imageRegistry: {host: localhost:5001,
  namespace: educates-fork}` redirects 12 runtime children + chart pod
  + pause to that prefix, while upstream pins stay put.

Mirrored in docs-of-record (values.yaml + JSON schema), v4 dev plan,
and the prior decisions.md entry.
…curity to umbrella globals

Cross-cutting deployment-scope values now live at the umbrella under
`global:`. Helm propagates them to every subchart as `.Values.global.<key>`.
Subcharts retain a local block of the same name with sensible defaults;
new helpers deep-merge the umbrella global over the subchart local, with
globals winning per-leaf where set. Subcharts remain independently
installable.

Concretely:

- session-manager: new `resolved{ImageRegistry,ClusterIngress,
  ClusterSecurity}` helpers feed `imageRegistryPrefix`, `derivedProtocol`,
  `operatorConfigYAML`, `kyverno-cluster-policies.yaml`,
  `clusterrolebindings.yaml`, and `secretcopiers.yaml`. Schema drops
  `clusterIngress` and `clusterSecurity` from `required` and removes the
  subchart-local `clusterIngress.domain` minLength — helpers do the
  post-merge `fail` instead.
- lookup-service: new `imageRegistry` block (default ghcr.io/educates),
  `image.repository` flips to empty (derives from imageRegistry),
  helpers added (`resolvedImageRegistry`, `imageRegistryPrefix`,
  `image.repository`). Wired into the Deployment. New
  `values.schema.json`.
- secrets-manager: same image-helper additions. `openshift.enabled`
  removed; the SCC ClusterRoleBinding now gates on
  `clusterSecurity.policyEngine == "OpenShiftSCC"` (resolved from
  globals when present). New `values.schema.json`.
- Umbrella `values.yaml` gains a `global:` block with commented examples
  for the three keys.
- All seven scenarios converted to canonical-globals shape: cross-
  cutting values live under `global:`, subchart blocks shrink to per-
  subchart concerns only. Verified end-to-end — setting
  `global.imageRegistry.{host,namespace}` redirects all chart pods +
  runtime children; `global.clusterSecurity.policyEngine: OpenShiftSCC`
  triggers SCC bindings in BOTH session-manager and secrets-manager.

Mirrored in the doc-of-record (note about dual-source pattern), the v4
dev plan (each cross-cutting block now flagged as a global), and a new
decisions.md entry covering the rationale and trade-offs.

Drive-by: `tests` added to .helmignore so scenario fixtures don't ship
with the chart package.
… render ca-trust-store init container

Drops lookup-service's specialised `caTrust` block and `ingress.tls`
field in favour of consuming `global.clusterIngress` (with subchart-
local fall-back). The lookup-service Ingress's TLS Secret now derives
from the resolved `clusterIngress.tlsCertificateRef.name` (typically
the wildcard cert covering `*.<domain>`), and the chart renders a
ca-trust-store init container when `clusterIngress.caCertificateRef.name`
is set.

- New `clusterIngress` block in lookup-service values.yaml mirrors the
  shape introduced in session-manager + the umbrella global.
- `caTrust.{secretName,initImage}` removed; the init image is no longer
  base-environment but the lookup-service main image itself (Fedora-
  based: has `update-ca-trust` and `tar`). Zero extra image pulls; the
  kubelet already has it on the node. Mirrors v3's
  overlay-ca-injector.yaml mechanism without the cost of pulling a
  multi-GB workshop image.
- `ingress.tls.secretName` removed; the Ingress derives TLS from the
  resolved `clusterIngress.tlsCertificateRef`.
- New `secretcopiers.yaml` auto-derives copy rules for both the TLS
  Secret and the CA Secret when their refs target a foreign namespace.
  Renders independently of session-manager's SecretCopier so this chart
  is installable standalone; under the umbrella both subcharts render
  their own rules (idempotent — same source-Secret copied once
  regardless of how many rules reference it).
- Helpers updated: drop `caTrust.image.{tag,pullPolicy}`, add
  `resolvedClusterIngress` and `caTrustEnabled`.
- Schema reflects the new shape.

Verified by enabling lookup-service against a TLS+CA scenario:
- Ingress emits `tls: [{secretName: wildcard-tls}]` with the
  hostname-specific `host` and the resolved cert name.
- Init container reuses the lookup-service image and runs
  `update-ca-trust && tar -C /etc/pki/ca-trust ...`.
- Main container mounts the CA-populated trust store at
  /etc/pki/ca-trust read-only.
- SecretCopier `educates-lookup-service-ingress-secrets` pulls both
  refs into the release namespace.
…ger Deployment

Mirrors what step 2 added to lookup-service: a chart-side ca-trust-store
init container that builds a CA-populated trust store from
`global.clusterIngress.caCertificateRef` and the main container mounts
the result at /etc/pki/ca-trust read-only. Reuses the main session-
manager image (Fedora-based: has `update-ca-trust` and `tar`) so no
extra image pull on the node.

v3 only injected the trust store into lookup-service. Including it in
session-manager too is harmless when the CA isn't needed and avoids
debugging "why does X fail TLS verify?" later if session-manager ever
gains code paths that reach external TLS endpoints fronted by the
private CA.

- New `session-manager.caTrustEnabled` helper + Deployment template
  rewire — initContainers / volumes / volumeMount conditionally on the
  resolved `clusterIngress.caCertificateRef.name`.
- The init container's securityContext explicitly sets
  `runAsNonRoot: false` to override the pod-level `runAsNonRoot: true`
  enforcement (the trust-store update needs UID 0 to write
  /etc/pki/ca-trust). Mirrored in lookup-service for consistency.
- Verified: scenario 02 (TLS+CA) renders the init container; scenario
  01 (no CA) does not; existing SecretCopier auto-derive still pulls
  the CA Secret into the release namespace where the init container
  consumes it.
…chart

Brings remote-access in line with the schema discipline applied to the
other three subcharts (additionalProperties: false; only `enabled` and
the Helm-injected `global` are valid). The subchart has no configurable
knobs in v0.1.0 — the schema serves purely as a typo-catcher and a
contract that future additions are deliberate.

All seven scenarios still render; an additional smoke-render with
remote-access.enabled=true also passes.
Restores the v3 cluster-node CA injection feature as a sibling subchart
under the umbrella, replacing the never-rendered
`session-manager.clusterIngress.caNodeInjector.enabled` stub. Toggle is
the umbrella's `node-ca-injector.enabled: false` (default off, opt-in)
via Helm's standard subchart-condition mechanism.

What renders when enabled (mirroring v3's 07-node-ca-injector.yaml):

- ServiceAccount + ClusterRole/Binding (Ingress watch) +
  Role/RoleBinding (ConfigMap manage in release ns).
- `node-ca-injector-controller` Deployment running the `controller`
  subcommand — watches Ingresses, builds the `educates-registry-hosts`
  ConfigMap.
- `node-ca-injector` DaemonSet running the `sync` subcommand —
  privileged, mounts the CA Secret + hosts ConfigMap + hostPath
  `/etc/containerd/certs.d`. Writes per-host containerd registry-CA
  configuration so containerd trusts the cluster's private CA when
  pulling images.
- SecretCopier auto-derived when the CA ref's namespace is foreign.

The subchart consumes `global.clusterIngress.caCertificateRef` (with
subchart-local fall-back for standalone) and fails fast at template time
if the resolved CA ref is empty. Image is derived from
`global.imageRegistry` so the same fork/local-registry knob redirects
this subchart too. Has its own `values.schema.json`.

Relationship to the per-pod ca-trust-store init container (steps 2/3):
complementary, not overlapping. Init container handles in-pod TLS verify
for our own Deployments; node-ca-injector handles container-runtime-
level trust for image pulls (including pulls performed by pods we don't
render — third-party operators, docker-in-docker workshop sessions).
Both keyed on the same global CA ref; both independently togglable.

`session-manager.clusterIngress.caNodeInjector.enabled` is removed from
values.yaml + values.schema.json + the doc-of-record. New decisions.md
entry covers the rationale + the relationship between the two CA-trust
mechanisms.
… top-level shape

Closes the validation gap on the umbrella's cross-cutting `global:`
block. Subchart schemas correctly treat `global` as opaque (they
shouldn't dictate the umbrella's contract), which meant typos like
`global.clusterSecuirty.policyEngine` or `global.imageRegistry.namespece`
silently fell through — every subchart fell back to its local defaults
and the user's intended override was lost.

The umbrella schema:
- Validates the `global.{imageRegistry,clusterIngress,clusterSecurity}`
  shape with `additionalProperties: false` at every level.
- Forbids unknown top-level keys (catches misspelled subchart names
  like `sesion-manager:` that Helm would otherwise treat as inert).
- Treats each subchart block (`secrets-manager`, `lookup-service`,
  `remote-access`, `session-manager`, `node-ca-injector`) as
  `{ "type": "object" }` and delegates detailed validation to that
  subchart's own schema. No duplication.

Verified: all three classes of typo (misspelled global key, misspelled
global nested field, unknown top-level key) trigger a clear schema
error at `helm template` time. All seven existing scenarios still
render cleanly.

decisions.md entry covers the split rationale.
End-to-end test for node-ca-injector: a workshop session builds a tiny
container image, pushes it to the per-session registry (HTTPS via the
wildcard cert), then creates a Deployment that pulls from that
registry. Successful rollout is the proof — without the cluster CA in
containerd's per-host trust under /etc/containerd/certs.d/, the pull
would fail with TLS verify errors and the rollout would hang.

Runner change:

- run-scenario.sh now detects `<scenario>/workshop/resources/workshop.yaml`
  and, if present, publishes the scenario-local workshop to the local
  registry stood up by `educates local cluster create` (localhost:5001)
  via `educates publish-workshop <scenario>/workshop`, then deploys it
  with `educates deploy-workshop -f <that path>`. Scenarios 01-07 keep
  the existing default-WORKSHOP_URL behaviour because they don't ship
  their own workshop directory.

Scenario 08 (`08-node-ca-injector-image-pull`):

- educates-config.yaml + pre-install.sh: identical to scenario 02
  (kind + Contour + Kyverno; pre-install materialises a wildcard TLS
  Secret + CA Secret in `educates-secrets`).
- chart-values.yaml: scenario-02 globals shape plus
  `node-ca-injector.enabled: true` at the umbrella.
- description.md: explains what the test demonstrates and what success
  looks like.
- workshop/: a complete `lab-node-ca-pull` workshop. Enables `docker`
  and `registry` session applications. Four content pages walk the user
  through writing a Dockerfile, building, tagging and pushing to
  `${REGISTRY_HOST}`, creating the Deployment, and watching
  `kubectl rollout status` succeed. The summary calls out which step is
  the actual proof of node-ca-injector working.

Verified all eight scenarios render cleanly under `helm template`;
scenario 08 emits all eight node-ca-injector resources plus the chart
pods with the ca-trust-store init container.

The runner pause at step 5/6 is the verification surface — interactive
since the proof lives in a workshop session, not in a kubectl assertion
the runner can make against the cluster directly.
…me defaults via Chart.yaml annotations

The user-facing image-registry knob is renamed to
`development.imageRegistry` (subchart-local) and
`global.development.imageRegistry` (umbrella), and the publish-time
default registry moves from a populated `values.yaml` block to
Chart.yaml annotations:

  educates.dev/image-registry-host: "ghcr.io"
  educates.dev/image-registry-namespace: "educates"

The release workflow updates these annotations per fork (one `yq -i`
call per Chart.yaml) so the chart that gets shipped points at the
right registry without a values override. Mirrors v3's
`push-installer-bundle` Makefile target which baked refs at OCI-bundle
build time, translated to a chart-publish edit step.

The runtime IMAGE_REPOSITORY semantic now matches v3's intent:

- When `development.imageRegistry` is set: emitted into the runtime
  config so workshop sessions get IMAGE_REPOSITORY={host}/{namespace}
  for `$(image_repository)` content placeholder resolution.
- When empty (normal use): runtime config's `imageRegistry` block is
  emitted empty; runtime falls back to
  `registry.default.svc.cluster.local` per
  `session-manager/handlers/operator_config.py:35`. This avoids
  silently breaking the local-dev workflow on installs that left a
  populated registry in place.

Implementation notes:

- Each subchart helper now has TWO resolvers:
    * `resolvedImageRegistry` falls back to Chart.yaml annotations and
      is consumed by chart-rendered + runtime-children image-ref
      composition.
    * `resolvedDevelopmentImageRegistry` (session-manager only — the
      subchart that owns the operator-config emission) does NOT fall
      back to annotations; returns user/global only. Emitted into the
      runtime config blob's `imageRegistry` field.
- Annotations added to all four image-rendering Chart.yamls (session-
  manager, lookup-service, secrets-manager, node-ca-injector). Helper
  reads `.Chart.Annotations["educates.dev/image-registry-..."]`.
- Subchart `values.yaml`: `imageRegistry` block dropped; replaced by
  empty `development.imageRegistry`. Schemas updated. Doc-of-record
  follows the session-manager subchart shape.
- Umbrella `values.yaml` and schema: `global.imageRegistry` →
  `global.development.imageRegistry`.
- Helper failure message updated to point at both override paths.

Verified end-to-end:

- Normal mode (scenario 01 with empty development.imageRegistry):
  chart pods resolve to `ghcr.io/educates/educates-{secrets,session}-
  manager:3.7.1` from annotations; runtime config blob has
  `imageRegistry: { host: "", namespace: "" }`.
- Dev override (`global.development.imageRegistry: { host:
  localhost:5001, namespace: educates-dev }`): all 12 Educates
  imageVersions entries redirect to localhost:5001/educates-dev/
  AND the runtime config blob carries the same registry, so workshops
  with `$(image_repository)` placeholders resolve consistently.

decisions.md gets a new entry superseding the prior `imageRegistry`
decision with the development-knob framing and the rationale for the
two-resolver split.
Captures GitHub-issue drafts for runtime simplifications that should
land once the v4 chart-based install ships in develop. Format mirrors
decisions.md — one heading per issue, prose body, date added — so
entries can be transcribed to the issue tracker with minimal further
editing.

Initial entries:

- Simplify `operator_config.py` IMAGE_REPOSITORY resolution. Drop the
  `imageRegistry.host` + `imageRegistry.namespace` compose logic in
  favour of a single `imageRepository` field, and stop falling through
  in `image_reference()` for short-names not in `imageVersions` —
  treat them as config errors instead.
- Drop `clusterIngress.tlsCertificate` / `caCertificate` inline forms
  from `operator_config.py`. The chart only emits the `*Ref` forms.
- CI lint: assert Chart.yaml annotations stay in sync across all four
  image-rendering subcharts (and optionally that version / appVersion
  / dependency versions match across umbrella + subcharts).
- Document the chart release workflow's annotation update step in the
  release runbook.

Each entry has a "trigger to file" so it doesn't get filed prematurely
while v3 is still the production install path.
Tighten Phase 0 in the v4 development plan and add three decisions-log
entries covering the choices made for kubebuilder bootstrap:

- Operator at installer/operator/, kubebuilder's config/ kustomize tree
  stripped; controller-gen writes CRDs and RBAC directly into the
  educates-installer Helm chart.
- Spec types adopt the full r3 shape from Phase 0; status grows
  alongside the reconciler that produces each field. Avoids dead API
  surface drifting from r3.
- Operator image at Phase 0 is a local-dev placeholder built via
  make docker-build; publish-time annotations and release workflow are
  deferred to Phase 6.

Also narrows Phase 0 CEL scope to singleton-name + mode-immutability
(mode-field exclusivity moves to Phase 1) and Phase 0 RBAC to the four
CRDs only (referenced-resource watches move to Phase 1). CLAUDE.md gets
a new "Operator project (Phase 0+)" block listing the make targets and
conventions.
Phase 0 step 1: bare kubebuilder scaffold at installer/operator/. No
real types or reconciler logic — just the layout we'll grow.

- Multigroup project (config + platform groups under domain
  educates.dev), repo path
  github.com/educates/educates-training-platform/installer/operator,
  added to root go.work.
- Four APIs scaffolded with controller stubs: EducatesClusterConfig
  (config/v1alpha1), SecretsManager / LookupService / SessionManager
  (platform/v1alpha1).
- Per the Phase 0 layout decision, kubebuilder's config/ kustomize
  tree is stripped; controller-gen writes CRDs and RBAC into
  bin/manifests/{crd,rbac} for now and will retarget the
  educates-installer chart in step 5.
- Makefile pruned of kustomize-dependent targets (install, uninstall,
  deploy, undeploy, build-installer, kustomize tool target,
  setup-test-e2e/test-e2e, docker-buildx, docker-push) and the
  kubebuilder-default test/e2e/ tree removed. smoke-test target is
  staged with a fail-fast message until step 5 wires it.
- Operator-local .github/workflows/ removed; the monorepo CI workflow
  for the operator lands in step 6.

Verified: go build ./..., go vet ./..., make generate, and make
manifests all run clean. CRDs + RBAC YAML produced for all four kinds.
Translate the full r3 EducatesClusterConfig spec surface into Go types
under api/config/v1alpha1/. Mirrors the CRD draft revision 3 in
docs/architecture/educates-crd-draft-v1alpha1-r3.md:

- Mode (Managed | Inline), with the full Managed-mode tree:
  Infrastructure (provider + optional cloud + service-account identities),
  Ingress (domain, ingressClassName, controller, certificates),
  Certificates (BundledCertManager / ExternalCertManager / StaticCertificate),
  ACME with DNS01 solvers (Route53, CloudDNS, Cloudflare, AzureDNS),
  DNS (BundledExternalDNS / Manual / None),
  PolicyEnforcement (clusterPolicy, workshopPolicy, kyverno),
  ImageRegistry (prefix + pullSecrets).
- Inline-mode tree mirroring the same surface where applicable
  (ingress, policyEnforcement, imageRegistry).
- Shared OperationalBlock duplicated at every Bundled use site per the
  r3 design intent (no schema-ref factoring).
- Static defaults marked with +kubebuilder:default for fields the r3
  doc calls out: dns.provider=None, clusterPolicy.engine=Kyverno,
  workshopPolicy.engine=Kyverno, kyverno.provider=Bundled.
- Enum validation on every closed-set field via
  +kubebuilder:validation:Enum.

Phase 0 CEL rules added (the only two in scope; mode-field exclusivity
moves to Phase 1 per the development plan):

- Singleton-name on the wrapper type:
  self.metadata.name == 'cluster'
- Mode immutability on the spec:
  self.mode == oldSelf.mode

Status surface is intentionally minimal (observedGeneration, phase,
conditions) per the "status grows alongside reconcilers" decision.

CRD shape is now Cluster-scoped with shortName ecc and Mode/Phase/Age
printer columns. controller-gen output verified: scope: Cluster, both
CEL rules present, four defaults populated, three printer columns,
~1.2k lines of well-formed YAML.

go build, go vet, make generate, make manifests all pass.
…rom r3

Translate the three platform-group CRDs from r3 into Go types under
api/platform/v1alpha1/. Mirrors the CRD draft revision 3:

- SecretsManager: image override, logLevel (default info), resources.
  No replicas knob (singleton at the pod level upstream). Image-pull
  credentials inherit from EducatesClusterConfig.status.imageRegistry.
- LookupService: ingress (prefix + optional tlsSecretRef override),
  image, logLevel (default info), resources. Component-specific knobs
  (auth, rate-limiting, storage) deferred until the lookup-service
  owner specifies them.
- SessionManager: ingressOverrides, workshopPolicyOverride, images
  (overrides only — registry prefix + pullSecrets inherit from
  EducatesClusterConfig.status), themes (ConfigMap/Secret/URL source
  type), defaultTheme, tracking (Google Analytics, Amplitude, Clarity,
  webhook), defaultAccessCredentials, sessionCookieDomain,
  allowedEmbeddingHosts, storage, network (packetSize, blockedCidrs),
  imageCache (default disabled), registryMirrors, logLevel.

Shared types in common_types.go: LogLevel, ComponentPhase,
LocalObjectReference, ImageRef. WorkshopPolicyEngine is duplicated in
sessionmanager_types.go to avoid coupling the platform package to the
config API group.

All three CRDs Cluster-scoped with singleton-name CEL
(self.metadata.name == 'cluster') and Phase/Age printer columns.
Status surface intentionally minimal (observedGeneration, phase,
conditions) per the Phase 0 status policy in decisions.md.

go build, go vet, make generate, make manifests all pass. CRDs render
clean (lookup ~257, secrets ~234, session ~431 lines).
…ines

Phase 0 step 5: stand up the educates-installer Helm chart and wire the
four trivial reconcilers. The chart is now the canonical artefact for
the v4 installer; controller-gen targets it directly per the Phase 0
layout decision.

Chart at installer/charts/educates-installer/:

- Chart.yaml apiVersion v2, kubeVersion >=1.31.0-0, version and
  appVersion locked at 4.0.0-alpha.1 (matches the runtime chart's
  versioning approach but tracks operator releases independently).
- crds/: the four CRDs from controller-gen, in Helm's reserved
  location — installed once on first helm install, not templated, not
  deleted on uninstall (mirrors the runtime chart's CRD-shipping
  decision).
- templates/rbac/role.yaml: ClusterRole "educates-installer-manager"
  generated by controller-gen. Phase 0 RBAC scope is exactly the four
  CRDs and their /status + /finalizers — no Secrets/ClusterIssuers/
  IngressClasses watches yet (those land in Phase 1 with the Inline
  validator).
- templates/rbac/role-binding.yaml, serviceaccount.yaml,
  deployment.yaml: hand-written, Helm-templated. Deployment runs the
  manager binary with --health-probe-bind-address=:8081, metrics off
  by default, leader election off (single replica).
- values.yaml: image as repository + tag (dev placeholder),
  imagePullSecrets, resources, nodeSelector, tolerations, affinity,
  leaderElection.enabled. Comment block in values.yaml documents the
  Phase 0 local-dev workflow (make docker-build + kind load + helm
  install).
- NOTES.txt: post-install message naming the four CRDs, calling out
  the Phase 0 stub-only state, and listing useful kubectl commands.

Operator changes:

- All four Reconcile() bodies emit a single "Reconciling X" log line
  with the request name and return — gives the smoke test something
  to grep for. The kubebuilder TODO scaffolding is removed and replaced
  with a Phase-pointing doc comment.
- Makefile manifests target now writes to ../charts/educates-installer/
  {crds,templates/rbac}/ instead of bin/manifests/. Role name set to
  "educates-installer-manager" to match the chart's hand-written
  ClusterRoleBinding.

Verified: go build/vet/generate clean, helm lint passes (only the
benign "icon is recommended" info), helm template renders all four
expected resources (ServiceAccount, ClusterRole, ClusterRoleBinding,
Deployment).
Phase 0 step 6 / final: replace the kubebuilder-scaffolded reconciler
tests with Phase 0 CEL validation specs, wire envtest to load CRDs from
the chart, add a local kind-based smoke test, and add a repo-root CI
workflow.

CRD validation tests (envtest, ginkgo):

- EducatesClusterConfig (config group): three specs — valid Managed-mode
  CR named "cluster" is accepted; CR with name != "cluster" is rejected
  by the singleton CEL; spec.mode change on update is rejected by the
  mode-immutability CEL.
- Platform group: one shared file, one Describe per CRD. Each verifies
  singleton-name acceptance and rejection.
- Both suite_test.go files now point at
  installer/charts/educates-installer/crds/ instead of the no-longer-
  present kubebuilder default config/crd/bases.
- The kubebuilder-scaffolded reconciler tests (one per kind, with TODO
  placeholders and incompatible "test-resource" naming) are removed.

Smoke test (hack/smoke-test.sh, local-only):

- Creates kind cluster on demand, builds the operator image with
  make docker-build, kind-loads it, helm-installs the
  educates-installer chart, applies a minimal EducatesClusterConfig,
  and asserts the "Reconciling EducatesClusterConfig" log line appears
  within 60s. Tears down on exit unless KEEP_CLUSTER=true.
- Wired into the previously-stubbed make smoke-test target.

CI (.github/workflows/installer-operator-ci.yaml):

- Triggers on changes to installer/operator/, the chart,
  go.work/go.work.sum, or the workflow itself.
- Steps: go vet, go build, manifests-drift check, generate-drift
  check, make test (envtest), make lint.
- Uses go-version-file pointing at the operator's go.mod so CI tracks
  whatever the project declares.

Go-version pin lowered:

- go.work and operator go.mod were both bumped to 1.25.7 by kubebuilder
  init. That triggered a "compile: version go1.25.7 does not match go
  tool version go1.25.6" warning chain under bash -e, which cascaded
  through the test recipe even though tests themselves passed. Lowered
  both to 1.25.0 — works under any 1.25.x toolchain and keeps the
  workspace consistent with the existing client-programs and node-ca-
  injector modules.

Operator README:

- Replaces the kubebuilder TODO-stub README with a tight summary of
  layout, the architecture docs, and the make targets.
Phase 1 step 1: extend the EducatesClusterConfig API surface with the
two structural rules deferred from Phase 0 and the inter-CR contract
fields component reconcilers will read.

CEL exclusivity (two new rules on EducatesClusterConfigSpec):

- When mode is Inline, the Managed-mode top-level fields
  (infrastructure, ingress, dns, policyEnforcement, imageRegistry) are
  forbidden.
- When mode is Managed, spec.inline is forbidden.

Combined with the existing mode-immutability rule, the spec now carries
three CEL invariants. All three are envtest-verified.

Status surface (the inter-CR contract):

- status.mode echoes spec.mode at the time of last successful reconcile
  so components can branch without reading spec.
- status.ingress (StatusIngress): domain, ingressClassName,
  wildcardCertificateSecretRef (NamespacedSecretRef — namespace +
  name), optional caCertificateSecretRef, optional clusterIssuerRef.
- status.policyEnforcement (StatusPolicyEnforcement): clusterPolicyEngine,
  workshopPolicyEngine.
- status.imageRegistry (reuses spec ImageRegistry shape): prefix and
  pullSecrets, populated even when empty so components see a single
  source of truth.

Status fields are populated by the Phase 1 reconciler in the next step.
The Managed-mode-only fields (bundledChartVersions) and conditions
(InfrastructureConfigured, IngressReady, CertificatesReady, DNSReady,
PolicyEnforcementReady) remain deferred to Phase 2/3 alongside their
producing reconcilers.

go build, go vet, make generate/manifests, and make test all pass; the
generated CRD reflects all three CEL rules.
Phase 1 step 2: thread the operator's own namespace from the chart
through to the reconciler, and restrict the Secret cache to that
namespace.

- Chart Deployment template: inject OPERATOR_NAMESPACE via the
  downward API (fieldRef metadata.namespace).
- cmd/main.go: read OPERATOR_NAMESPACE at startup; fail fast if unset
  so a misconfigured Deployment doesn't silently misbehave.
- Manager cache.Options.ByObject restricts &corev1.Secret{} reads to
  the operator namespace only — user-supplied Secrets referenced from
  spec.inline live there, and the operator has no need to cache
  Secrets cluster-wide. ClusterIssuers (cluster-scoped) and
  IngressClasses (cluster-scoped) keep cluster-wide cache, as they
  must.
- EducatesClusterConfigReconciler gains an OperatorNamespace string
  field; main.go threads the value in. The Phase 0 stub doesn't read
  it yet, but the wiring is now in place for the Inline validator
  (steps 4-5).

go build, go vet, make test all pass; helm lint clean; helm template
shows the new env var injection in the rendered Deployment.
Phase 1 step 3: extend the operator's ClusterRole with read-only access
to the resources Inline-mode validation needs to look up.

- Secrets (core): for the wildcard TLS Secret, optional CA Secret, and
  imageRegistry pullSecrets. Reads are cache-restricted to the
  operator namespace by the Phase 1 step 2 cache.Options.
- ClusterIssuers (cert-manager.io): when spec.inline.ingress.
  clusterIssuerRef is set, the validator checks existence and the
  Ready condition.
- IngressClasses (networking.k8s.io): the validator checks the
  referenced IngressClass exists.

All three are get/list/watch only — Inline mode never modifies cluster
state. Markers added on the EducatesClusterConfigReconciler so
controller-gen produces them; role.yaml regenerated into the chart.
Phase 1 step 4: implement the EducatesClusterConfig Inline-mode
reconciler. The operator now drains validates referenced cluster state
and publishes the inter-CR status contract that Phase 4 components
will consume.

Validator (validator.go):

- checkIngressClass: cluster-scoped Get against networkingv1.IngressClass.
- checkWildcardSecret: Get in operator namespace; assert tls.crt and
  tls.key keys present.
- checkCASecret: optional; Get in operator namespace; assert ca.crt key
  present.
- checkClusterIssuer: optional; unstructured.Unstructured against
  cert-manager.io/v1 ClusterIssuer; assert status.conditions[Ready]==True.
  IsNoMatchError (cert-manager CRD absent) is surfaced as a validation
  error rather than a reconcile retry — matches how a user would
  experience it. Vendored cert-manager Go types are deferred to Phase 2
  per the recorded decision.
- validationError type carries spec.<field> path + reason so condition
  messages name the offending input.

Reconcile flow (educatesclusterconfig_controller.go):

- Add finalizer "educatesclusterconfig.config.educates.dev/finalizer"
  on first sight; first pass returns Requeue so the next pass sees a
  stable resource version. Phase 1 deletion handler is a no-op; Phase
  2 Managed-mode will uninstall charts here in reverse install order.
- For Inline mode: run validator → on success, populate
  status.{mode,ingress,policyEnforcement,imageRegistry} and set
  Ready=True / ValidationSucceeded=True. On failure: set Phase Degraded
  and both conditions to False with the field-specific message.
  status.imageRegistry is always populated (empty struct when unset)
  so components see a single source of truth.
- For Managed mode: no-op stub until Phase 2.
- Defensive guard: if mode==Inline and spec.inline is nil (CEL bypass),
  Degraded with "spec.inline required".

Envtest specs (validator_test.go):

- All-refs-valid → Ready, finalizer set, full status contract populated.
- Wildcard Secret missing → Degraded, message names the field +
  "not found".
- Wildcard Secret present without tls.crt → Degraded.
- IngressClass missing → Degraded.
- Optional CA Secret referenced but missing → Degraded.
- Delete clears the finalizer and the apiserver removes the object.

Test helpers: makeWildcardSecret uses Opaque type because
kubernetes.io/tls Secrets are apiserver-validated to require both
tls.crt + tls.key, which would block the missing-key test.

go build / vet / test all pass; config-package coverage 63.2% (the
gap is mostly the unreachable Managed-mode branch in Reconcile, which
becomes covered once Phase 2 implements it).
configuration-settings documents externalTLSTermination on the three
kinds and the ingressOverrides.protocol field (also corrects the
SessionManager not-yet-supported list: Secret-sourced themes and
imagePrePuller are wired; defaultAccessCredentials, registryMirrors
and non-Secret theme sources are rejected). secure-http-connections
replaces the standalone-chart workaround with the supported override,
noting certificate settings are still required. Migration guide maps
clusterIngress.protocol accordingly. decisions.md records the
placement rationale; the External-load-balancer follow-up is marked
partially landed with the remaining scope.
Verifying the operational.replicas plumbing against the vendored
upstream charts showed the shared shape didn't hold: external-dns
1.21.1 hardcodes replicas to 1 and silently swallows the bogus
replicaCount value, Kyverno fanning one count across its four
controllers conflicts with upstream HA guidance (3+ for the admission
controller only), and cert-manager never consumed the block.

Remove operational from bundledCertManager and bundledExternalDNS,
drop the kyverno.bundled wrapper (it only carried operational), and
delete the corresponding renderer plumbing. BundledContour keeps the
block; its replicas knob maps to contour.replicaCount as before.

Regenerate CRDs, deepcopy, the CLI-embedded chart copy, and the
CRD-derived EducatesConfig schema. Amend the r3 draft (dated note in
the operational-block pattern section, open item 4 now tracks
per-service shapes) and add a decisions-log entry.
Five new samples populating every supported v1alpha1 spec field:
05-managed-full (Managed-mode kitchen sink, GKE-flavoured, with the
reserved-but-rejected surface listed in the header), 06-inline-full
(generic BYO with every inline field including the cross-namespace CA
ref and clusterIssuerRef), and -full variants of the three platform
CRs (sessionmanager-full keeps the rejected defaultAccessCredentials
and registryMirrors blocks as comments). README gains both table
sections and a note that -full files are field references, not
starting points.

All five strict-decode against the operator's typed API.
gke-full / eks-full / inline-full populate every field of their JSON
schema, with toggles set opposite their defaults and explicit service
accounts / role ARNs so the new round-trip tests also prove that
WithDefaults doesn't overwrite explicit values. Mirrors the existing
local-full.yaml + TestLoad_FullLocalConfig_RoundTripsAllFields pair.
Every subchart enabled (lookup-service and remote-access for the
first time in any scenario) and every session-manager value block
populated. pre-install stages the TLS/CA pair (scenario-02 logic)
plus two theme Secrets and a dummy pull Secret; post-deploy asserts
in four groups: typed-values serialisation into the educates-config
blob, SecretCopier plumbing, optional-subchart rollouts, and the
external defaultTheme served by the portal.

The chart-values header documents the per-pod operational knobs
(image, imagePullSecrets, resources, development.imageRegistry)
deliberately left at defaults and where to set them. tests/README
gains the missing 07/08 rows alongside the new 09 entry.
New conceptual companion to the CLI/Helm how-to guides: the three
installation object layers (CLI config kinds, operator custom
resources, Helm charts) and their relations, the five-stage value
pipeline with an inspection command for every intermediate (render,
helm get values per release, the ECC status contract, the
educates-config Secret), a worked example tracing the ingress domain
end to end, and an entry-point table for the CLI, GitOps, and
standalone-Helm paths.
# Conflicts:
#	developer-docs/release-procedures.md
#	node-ca-injector/Dockerfile
#	vendir.lock.yml
#	vendir.yml
The imgpkg bundle duplicated what other artifacts already provide:
GitHub releases carry the binaries for all four platforms, and the
educates-cli image covers container use. Nothing in the repo consumed
the bundle, and the push-client-programs Makefile target had a
copy-paste bug (built linux twice, never darwin) that nobody noticed.

Removes the publish-client-programs CI job, the push-client-programs
Makefile target, and rewrites the docs to point at GitHub releases,
workflow-run artifacts, and the educates-cli image instead.
The local CLI config snippets in the getting-started docs still used
the v3 key names (localKindCluster, localDNSResolver, podCIDR/
serviceCIDR); the EducatesLocalConfig schema names these cluster,
resolver and podSubnet/serviceSubnet.

build-instructions.md still described the v3 source-deploy flow
entirely: the developer-testing values file, the deploy-platform /
push-installer-bundle / deploy-platform-app make targets (removed
with the Carvel installer) and the create-cluster --version flag.
Rewritten around the v4 flow: educates local config init/edit,
educates admin platform deploy --local-config, imageVersions
overrides for locally built images, and the operator docker-build +
kind load loop.
Fills in the New Features placeholder with the operator-based install
pipeline (educates-installer chart, four CRDs, Managed/Inline modes,
bundled cluster services, ACME DNS01), the standalone runtime chart,
the kind-based CLI configuration with published JSON schemas, the new
admin platform / local cluster / local config commands, local-CA TLS,
externalTLSTermination, and the OCI chart + image-list publishing.

Prepends the v3-removal entries to Features Changed: Carvel installer
gone with no in-place upgrade, values.yaml to config.yaml migration
and key renames, no certificate-less installs, imageCache renamed to
imagePrePuller, and the educates-client-programs bundle removal.
Every user-facing change on this branch must update
project-docs/release-notes/version-4.0.0.md before it counts as
complete; internal-only changes are exempt.
The generated local CA was only reachable as base64 inside the cached
Secret YAML, so there was no reasonable way to tell users how to make
their browser trust workshop URLs.

'educates local secrets export NAME --pem' prints the certificate of a
cached TLS/CA secret as PEM (never the key), and 'local secrets add ca'
now prints the export command and a docs pointer after generating a CA.
The quick start gains a 'Trusting the workshop certificates' section
with per-platform trust-store import instructions, and the 4.0.0
release notes cover the new flag.
Completes the chart bump started in the working tree (Makefile +
SHA256SUMS + downloaded tarballs): embed.go now embeds contour-0.6.0
(appVersion 1.33.5) and kyverno-3.8.1 (appVersion v1.18.1), and the
old tarballs are removed. Workload-name readiness gates re-verified
by rendering both new charts: contour still produces contour-contour /
contour-envoy with no webhooks, kyverno still produces the same four
controller Deployments.

The bundled kyverno policies in the session-manager chart move from
kyverno/policies release-1.15 to release-1.18 to match: four
cluster-policies pick up modernised CEL (variables blocks, optional
chaining), the workshop-policies are unchanged, and policy names are
stable so per-workshop cloning is unaffected. The session-manager
subchart tarball is repackaged so the operator installs the refreshed
policies.

A new directory-consistency test ties the Makefile VENDORED_CHARTS
list, SHA256SUMS, embed.go and the tarballs on disk together, so a
half-finished bump or stale tarball now fails 'make test' instead of
silently shipping the old chart. Both vendoring READMEs now document
the full upgrade workflow, including the kyverno-policies follow-up.
…ge overrides

The ImageOverride doc claimed any image could be overridden by short
name, but three names silently did nothing: the session-manager
chart-pod image, the pre-puller pause image and node-ca-injector all
live outside the chart's imageVersions inventory. applySMImageValues
now routes those three to the chart values that actually control them
(image, imagePrePuller.pauseImage, and the node-ca-injector subchart's
image via renderNodeCAInjectorValues); everything else flows through
the inventory as before. The imagePrePuller enabled-toggle writer and
the pauseImage router compose into one map instead of clobbering.

Groundwork for the local build flow, where a dev-built CLI defaults
every core image to the local registry.
…registry

A CLI whose compiled-in version is not a semver (i.e. not stamped by
the release pipeline — 'latest' from make, 'develop' fallback) now
defaults an imageVersions entry for every core platform image to
<imageRepository>/educates-<name>:<version>, and pins the operator
pullPolicy to Always so rebuilt dev tags re-pull. Release binaries are
semver-stamped and skip all of this; user-supplied entries always win.

With the root Makefile building the same image set into
localhost:5001, this is what makes 'make' + 'educates local cluster
create' deploy the locally built system with zero manual config.
imageVersions entries named secrets-manager and lookup-service were
silently appended to SessionManager.spec.images.overrides, where the
chart's inventory ignores them — those components have their own CRs
with ImageRef-shaped spec.image fields the reconcilers already
consume. The translator now routes those two names there (split into
repository + tag) and excludes them from the SessionManager overrides,
uniformly across the Local, Inline, GKE and EKS kinds.
Plain 'make' now produces a complete locally-testable system: the
educates CLI built with dev ldflags (projectVersion=latest,
imageRepository=localhost:5001), the always-on local registry
auto-deployed via that CLI, all core platform images and the operator
image pushed to localhost:5001 — followed by a printed
'educates local cluster create' next step.

Mechanics, per the build-harmonization direction in the integration
plan (the cheap slice; component Makefiles and CI repointing stay
deferred):

- One image-% pattern rule replaces the 17 near-identical docker build
  recipes; per-image context dirs and build args via IMAGE_DIR.<name> /
  IMAGE_BUILD_ARGS.<name>. The operator image is now buildable from the
  root (image-operator), preceded by refresh-operator-embeds so the
  embedded subchart tarballs are always fresh (with a content-aware
  restore so helm's non-reproducible gzip doesn't dirty the tree).
- Images build for the current host architecture ONLY unless
  TARGET_PLATFORMS is set explicitly — no more silent QEMU-emulated
  multiarch on push builds. The CLI builds for the host platform.
- build-cli stamps the dev ldflags (overridable via CLI_VERSION /
  CLI_IMAGE_REPOSITORY) and refreshes the CLI-embedded chart + schemas
  first. build-client-programs kept as an alias.
- Old build-<image> targets are dropped in favor of image-<name>;
  verify-installer-chart / verify-cli-schemas / embed-installer-chart /
  generate-cli-schemas keep their recipes (CI contract). The 130-line
  drifted header is replaced by a short one + 'make help'.

Verified: release-stamped CLI renders zero localhost references against
a clean config; dev CLI renders all 12; verify targets green.
build-instructions.md now leads with 'make' + 'educates local cluster
create' as the complete from-source loop, documents the
TARGET_PLATFORMS host-arch-only default and the other knobs, the
image-<name> targets, the dev-vs-release CLI behavior, and the
workshop-image opt-in path. The 4.0.0 release notes gain the
imageVersions routing + dev-CLI defaulting entry.
go vet flags the unbuffered os.Signal channel passed to signal.Notify
(a signal arriving before the goroutine is ready would be dropped).
Came in from develop via the back-merge; one-line fix keeps
client-programs CI vet green.
…nstaller

# Conflicts:
#	client-programs/Dockerfile
…e workshop ClusterPolicy

Kyverno 1.18 introduces the ValidatingPolicy type (policies.kyverno.io,
ValidatingAdmissionPolicy-shaped), the recommended successor to the CEL
ClusterPolicy form and the fix for the CEL admission warnings the old set
emitted. The bundled policy set now ships as ValidatingPolicy:

- cluster-policies (PSS baseline + restricted) re-vendored from
  kyverno/policies@release-1.18 pod-security-vpol, rendered cluster-wide as
  ValidatingPolicy (Audit). Drops the upstream-curated host-ports-range
  alternate (Audit-only, no enforcement change).
- workshop-policies: 7 re-vendored from best-practices-vpol / other-vpol; the
  3 nginx-ingress policies and the Educates-internal require-ingress-session-name
  have no upstream -vpol variant and are hand-ported to ValidatingPolicy CEL
  (require-ingress-session-name moves off the legacy JMESPath apiCall to read the
  session name from namespaceObject, fail-closed when the label is absent).

session-manager/handlers/kyverno_rules.py is rewritten to scope each policy to a
workshop environment's session namespaces. New Policy types get a per-env copy
with spec.matchConstraints.namespaceSelector injected and validationActions from
the workshop action (Enforce->Deny, Audit->Audit). Legacy ClusterPolicy supplied
via workshopSecurity.additionalKyvernoPolicies is still scoped (merged as before)
but logs a deprecation warning, tracking Kyverno's ClusterPolicy removal in 1.20.
Unrecognised kinds are skipped with a warning. Core logic factored out and
covered by a new pytest.

session-manager gains an RBAC grant for validatingpolicies.policies.kyverno.io
(the Kyverno-shipped admin:policies role covers only kyverno.io); the legacy
binding is kept for the deprecated path. Release notes document the change and
the ClusterPolicy deprecation.
On kind the cluster maps host 80/443 to the node's 80/443, so Envoy must
bind those node ports via hostPort to receive ingress traffic. The v4
Contour reconciler only set envoy.service.type and dropped v3's
"useHostPorts: true" for kind, so with a ClusterIP/LoadBalancer-pending
Envoy nothing listened on the node's 443 and TLS connections to workshop
and portal URLs were reset during the handshake (the cert and Envoy
itself were fine).

renderContourValues now enables envoy.useHostPort.{http,https} whenever
envoyServiceType is ClusterIP — a ClusterIP ingress controller is
non-functional without it, and this mirrors v3's kind topology
(ClusterIP service + host ports). The EducatesLocalConfig translator now
selects envoyServiceType: ClusterIP for the kind-based local install
(it previously left it unset, defaulting to LoadBalancer, which never
gets an address on kind). Cloud installs keep LoadBalancer/NodePort and
get no hostPort. Unit tests cover both the operator value mapping and
the translator invariant.
…nstaller

# Conflicts:
#	assets-server/Dockerfile
Encodes how to update the third-party frontend libraries vendored into
the workshop dashboard theme (Bootstrap, Font Awesome, jQuery,
Underscore.js, JSONForm, js-yaml): upstream sources and versioned URLs,
the zip-subset vs single-file download shapes, target paths, and
verification. Also documents keeping each library aligned with the
matching npm dependency in the renderer and gateway package.json files,
which bundle the same libraries via esbuild.
…ironment)

Both Go builder stages lacked the --platform=$BUILDPLATFORM prefix, so
multiarch image builds ran the Go toolchain under QEMU for the
non-native architecture instead of cross-compiling.

- installer/operator/Dockerfile already had the cross-compile machinery
  (ARG TARGETOS/TARGETARCH, CGO_ENABLED=0 GOOS/GOARCH go build); only the
  --platform prefix was missing. Added it (matches node-ca-injector /
  client-programs / assets-server).
- workshop-images/base-environment/Dockerfile built git-serve natively
  (no GOOS/GOARCH); pinned the builder to BUILDPLATFORM and added the
  TARGETOS/TARGETARCH + CGO_ENABLED=0 GOOS/GOARCH cross-compile. git-serve
  v0.0.5 cross-compiles cleanly to linux/amd64 and linux/arm64 (verified,
  statically linked); the output name and downstream COPY are unchanged.

All five Go Dockerfiles now use the same cross-compile pattern.
…e-notes skills

- New educates-upgrade-cluster-services skill: upgrading the vendored
  upstream cluster-service charts (cert-manager, Contour, Kyverno,
  external-dns) the operator installs — the four-places-in-lockstep model,
  per-chart sources (incl. Kyverno's chart-vs-binary version resolution),
  vendor-charts/verify/test flow, and the reconciler re-verification step.
  The vendored-charts README gains the matching Kyverno version-resolution
  note so the two stay in agreement.
- educates-upgrade-go: add the installer/operator module (go.mod +
  Dockerfile), drop the removed tunnel-manager go.mod, reflect
  go-version-file-driven CI, and document the --platform=$BUILDPLATFORM
  cross-compile pattern (now on all five Go Dockerfiles).
- educates-release-notes: fix tag lookup (no 'v' prefix), replace
  'Upcoming Changes' with a Deprecations section, drop a stale path.
The client-programs and installer-operator CI workflows duplicated their
check lists in YAML, letting them drift from how the code is actually
built — three failures hid behind that gap. Make the Makefile the single
source of truth and have the workflows call it.

New root Makefile targets, mirrored 1:1 by the workflows:

- ci / ci-cli / ci-operator — go vet/build, drift checks, envtest, lint
  and chart-version lint — plus stage-renderer-files, factored out and
  reused by build-cli.
- client-programs-ci.yaml and installer-operator-ci.yaml now just run
  `make ci-cli` / `make ci-operator` after checkout + setup-go. The
  operator's separate chart-sync-lint job folds into ci-operator.

Fix the checks this surfaced, each previously masked by an earlier
failing step:

- Stage the gitignored renderer theme embed dir before go vet/build:
  hugo.go embeds it via //go:embed all:files/*, so a bare checkout
  matched nothing and client-programs CI failed every run.
- operator validator: read the referenced ClusterIssuer via APIReader
  in checkClusterIssuer. The deferred ClusterIssuer watch uses an
  unstructured informer while the read used the typed cached client —
  two independent caches — so on deletion the watch fired the reconcile
  but the stale typed read left status wedged at Ready, with nothing
  re-triggering (flaky "flips Ready to Degraded" envtest spec under
  load). Matches the existing checkCASecret APIReader precedent.
- client-programs CompressDirToFile discarded the os.Create error
  (errors.Errorf with no format verb, result not returned) — return a
  wrapped error instead.
- operator test files tripped golangci-lint once make lint finally ran:
  a goconst "latest" constant, three lll line-length wraps, and two
  modernize strings.SplitSeq updates.

Document the targets and the theme-staging / GOTOOLCHAIN gotchas in
developer-docs/build-instructions.md, the operator README, and CLAUDE.md.

Also bump GitHub Actions versions across the workflows (checkout,
setup-go, docker/*, cache, artifact, pages, gh-release).
@jorgemoralespou jorgemoralespou force-pushed the feature/4.0/helm-installer branch from 540adef to 2fd33cc Compare June 14, 2026 12:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant